Search CORE

527 research outputs found

BASiCS: Bayesian Analysis of Single-Cell Sequencing Data

Author: Marioni JC
Richardson S
Vallejos CA
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 13/05/2015
Field of study

Single-cell mRNA sequencing can uncover novel cell-to-cell heterogeneity in gene expression levels in seemingly homogeneous populations of cells. However, these experiments are prone to high levels of unexplained technical noise, creating new challenges for identifying genes that show genuine heterogeneous expression within the population of cells under study. BASiCS (Bayesian Analysis of Single-Cell Sequencing data) is an integrated Bayesian hierarchical model where: (i) cell-specific normalisation constants are estimated as part of the model parameters, (ii) technical variability is quantified based on spike-in genes that are artificially introduced to each analysed cell's lysate and (iii) the total variability of the expression counts is decomposed into technical and biological components. BASiCS also provides an intuitive detection criterion for highly (or lowly) variable genes within the population of cells under study. This is formalised by means of tail posterior probabilities associated to high (or low) biological cell-to-cell variance contributions, quantities that can be easily interpreted by users. We demonstrate our method using gene expression measurements from mouse Embryonic Stem Cells. Cross-validation and meaningful enrichment of gene ontology categories within genes classified as highly (or lowly) variable supports the efficacy of our approach

Directory of Open Access Journals

PubMed Central

Spiral - Imperial College Digital Repository

FigShare

Structure and evolutionary history of a large family of NLR proteins in the zebrafish

Author: Howe K
Kondrashov F
Laird GK
Leptin M
Marioni JC
Schiffer PH
Soylemez O
Wiehe T
Zielinski J
Publication venue
Publication date: 01/04/2016
Field of study

Multicellular eukaryotes have evolved a range of mechanisms for immune recognition. A widespread family involved in innate immunity are the NACHT-domain and leucine-rich-repeat-containing (NLR) proteins. Mammals have small numbers of NLR proteins, whereas in some species, mostly those without adaptive immune systems, NLRs have expanded into very large families. We describe a family of nearly 400 NLR proteins encoded in the zebrafish genome. The proteins share a defining overall structure, which arose in fishes after a fusion of the core NLR domains with a B30.2 domain, but can be subdivided into four groups based on their NACHT domains. Gene conversion acting differentially on the NACHT and B30.2 domains has shaped the family and created the groups. Evidence of positive selection in the B30.2 domain indicates that this domain rather than the leucine-rich repeats acts as the pathogen recognition module. In an unusual chromosomal organization, the majority of the genes are located on one chromosome arm, interspersed with other large multigene families, including a new family encoding zinc-finger proteins. The NLR-B30.2 proteins represent a new family with diversity in the specific recognition module that is present in fishes in spite of the parallel existence of an adaptive immune system

UCL Discovery

Differential expression analysis for sequence count data

Author: A Agresti
A Mortazavi
AC Cameron
AM Smith
AS Morrissy
B Langmead
C Loader
CI Bliss
DD Licatalosi
G Robertson
GK Smyth
GK Smyth
I Lönnstedt
J Bullard
JC Marioni
JF Lawless
JS Bloom
K Saha
L Wang
L Whitaker
M Kasowski
MD Robinson
MD Robinson
MD Robinson
MD Robinson
P Engström
P McCullagh
RC Gentleman
Simon Anders
SJ Clark
U Nagalakshmi
Wolfgang Huber
Y Benjamini
Publication venue
Publication date: 01/01/2010
Field of study

*Motivation:* High-throughput nucleotide sequencing provides quantitative readouts in assays for RNA expression (RNA-Seq), protein-DNA binding (ChIP-Seq) or cell counting (barcode sequencing). Statistical inference of differential signal in such data requires estimation of their variability throughout the dynamic range. When the number of replicates is small, error modelling is needed to achieve statistical power.

*Results:* We propose an error model that uses the negative binomial distribution, with variance and mean linked by local regression, to model the null distribution of the count data. The method controls type-I error and provides good detection power. 

*Availability:* A free open-source R software package, _DESeq_, is available from the Bioconductor project and from "http://www-huber.embl.de/users/anders/DESeq":http://www-huber.embl.de/users/anders/DESeq

Crossref

Springer

Springer - Publisher Connector

PubMed Central

Institute of Mathematics AS CR, v. v. i.

Nature Precedings

Impact of Alternative Splicing on the Human Proteome

Author: Aebersold R
Brazma A
Gonzàlez-Porta M
Liu Y
Marioni JC
Santos S
Venkitaraman AR
Wickramasinghe VO
Publication venue: Cell Reports
Publication date: 01/08/2017
Field of study

Alternative splicing is a critical determinant of genome complexity and, by implication, is assumed to engender proteomic diversity. This notion has not been experimentally tested in a targeted, quantitative manner. Here, we have developed an integrative approach to ask whether perturbations in mRNA splicing patterns alter the composition of the proteome. We integrate RNA sequencing (RNA-seq) (to comprehensively report intron retention, differential transcript usage, and gene expression) with a data-independent acquisition (DIA) method, SWATH-MS (sequential window acquisition of all theoretical spectra-mass spectrometry), to capture an unbiased, quantitative snapshot of the impact of constitutive and alternative splicing events on the proteome. Whereas intron retention is accompanied by decreased protein abundance, alterations in differential transcript usage and gene expression alter protein abundance proportionate to transcript levels. Our findings illustrate how RNA splicing links isoform expression in the human transcriptome with proteomic diversity and provides a foundation for studying perturbations associated with human diseases.We gratefully acknowledge funding from the EMBL (to M.G.-P. and J.C.M.), the NIH (U01CA152813 to Y.S.L. and R.A.), the ERC (AdG-670821 [Proteomics 4D] to R.A.), the Swiss National Science Foundation (31003A_166435 to R.A.), SystemsX.ch through project PhosphonetX-PPM (to R.A.), the UK Medical Research Council (G1001521, G1001522, and 4050551988 to A.R.V.), and the NHMRC (1127745 to V.O.W.). V.O.W. is supported by an innovation fellowship from VESKI

Repository for Publications and Research Data

Directory of Open Access Journals

Apollo (Cambridge)

Recommended from our members

Stella modulates transcriptional and endogenous retrovirus programs during maternal-to-zygotic transition

Author: Do DV
Hackett JA
Huang Y
Kim JK
Lee C
Marioni JC
Penfold CA
Surani MA
Zylicz JJ
Publication venue: eLife
Publication date: 01/03/2017
Field of study

The maternal-to-zygotic transition (MZT) marks the period when the embryonic genome is activated and acquires control of development. Maternally inherited factors play a key role in this critical developmental process, which occurs at the 2-cell stage in mice. We investigated the function of the maternally inherited factor Stella (encoded by Dppa3) using single-cell/embryo approaches. We show that loss of maternal Stella results in widespread transcriptional mis-regulation and a partial failure of MZT. Strikingly, activation of endogenous retroviruses (ERVs) is significantly impaired in Stella maternal/zygotic knockout embryos, which in turn leads to a failure to upregulate chimeric transcripts. Amongst ERVs, MuERV-L activation is particularly affected by the absence of Stella, and direct in vivo knockdown of MuERV-L impacts the developmental potential of the embryo. We propose that Stella is involved in ensuring activation of ERVs, which themselves play a potentially key role during early development, either directly or through influencing embryonic gene expression.The work was funded by a studentship to YH from the James Baird Fund, University of Cambridge, by the DGIST Start-up Fund of the Ministry of Science, ICT and Future Planning to JKK, by a core grant from EMBL and CRUK to JCM, by a Wellcome Trust Senior Investigator Award to MAS, and by a core grant from the Wellcome Trust and Cancer Research UK to the Gurdon Institute

Apollo (Cambridge)

DGIST Library Institutional Repository

Common Genetic Variants Explain the Majority of the Correlation Between Height and Intelligence : The Generation Scotland Study

Author: Archie Campbell
B Benyamin
BH Smith
BH Smith
C Gale
Caroline Hayward
CM Calvin
CM Calvin
CM Haworth
CM Lee
D Wechsler
D Wechsler
David J. Porteous
E Whitley
G Davies
G. David Batty
GD Batty
Ian J. Deary
IJ Deary
IJ Deary
J Yang
J Yang
JC Raven
JM Starr
JM Sundet
JP Beauchamp
K Silventoinen
K Silventoinen
KL Gunderson
LJ Whalley
Lynne J. Hocking
M Trzaskowski
MC Keller
MD Lezak
Peter M. Visscher
Riccardo E. Marioni
S Macgregor
SH Lee
Shona M. Kerr
SM Kerr
TA Paajanen
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2014
Field of study

Creative Commons Attribution LicensePeer reviewedPublisher PD

Aberdeen University Research

Crossref

Springer - Publisher Connector

PubMed Central

Edinburgh Research Explorer

University of Queensland eSpace

Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments

Author: A Lee
A Mortazavi
A Oshlack
B Ewing
B Langmead
DR Bentley
DY Chiang
Elizabeth Purdom
ET Wang
H Li
Illumina
Illumina
J Lu
James H Bullard
JC Dohm
JC Marioni
Kasper D Hansen
MA Taub
MAQC Consortium
MD Robinson
PAC Hoen
RA Irizarry
RA Irizarry
RD Canales
S Durinck
Sandrine Dudoit
U Nagalakshmi
Publication venue: BioMed Central
Publication date: 21/04/2009
Field of study

Abstract Background High-throughput sequencing technologies, such as the Illumina Genome Analyzer, are powerful new tools for investigating a wide range of biological and medical questions. Statistical and computational methods are key for drawing meaningful and accurate conclusions from the massive and complex datasets generated by the sequencers. We provide a detailed evaluation of statistical methods for normalization and differential expression (DE) analysis of Illumina transcriptome sequencing (mRNA-Seq) data. Results We compare statistical methods for detecting genes that are significantly DE between two types of biological samples and find that there are substantial differences in how the test statistics handle low-count genes. We evaluate how DE results are affected by features of the sequencing platform, such as, varying gene lengths, base-calling calibration method (with and without phi X control lane), and flow-cell/library preparation effects. We investigate the impact of the read count normalization method on DE results and show that the standard approach of scaling by total lane counts (e.g., RPKM) can bias estimates of DE. We propose more general quantile-based normalization procedures and demonstrate an improvement in DE detection. Conclusions Our results have significant practical and methodological implications for the design and analysis of mRNA-Seq experiments. They highlight the importance of appropriate statistical methods for normalization and DE inference, to account for features of the sequencing platform that could impact the accuracy of results. They also reveal the need for further research in the development of statistical and computational methods for mRNA-Seq.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

eScholarship - University of California

Collection Of Biostatistics Research Archive

iGepros: an integrated gene and protein annotation server for biological nature exploration

Author: A Conesa
A Subramanian
BT Sherman
Chaochun Wei
D Hartl
Guangyong Zheng
Haibo Wang
J Peng
JC Marioni
M Schena
MA Behr
Q Zheng
R Aebersold
RC Gentleman
S Bauer
S Carbon
S Falcon
Y Moriya
Yixue Li
Z Wang
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

Abstract Background In the post-genomic era, transcriptomics and proteomics provide important information to understand the genomes. With fast development of high-throughput technology, more and more transcriptomics and proteomics data are generated at an unprecedented rate. Therefore, requirement of software to annotate those omics data and explore their biological nature arises. In the past decade, some pioneer works were presented to address this issue, but limitations still exist. Fox example, some of these tools offer command line only, which is not suitable for those users with little or no experience in programming. Besides, some tools don’t support large scale gene and protein analysis. Results To overcome these limitations, an integrated gene and protein annotation server named iGepros has been developed. The server provides user-friendly interfaces and detailed on-line examples, so most researchers even those with little or no programming experience can use it smoothly. Moreover, the server provides many functionalities to compare transcriptomics and proteomics data. Especially, the server is constructed under a model-view-control framework, which makes it easy to incorporate more functions to the server in the future. Conclusions In this paper, we present a server with powerful capability not only for gene and protein functional annotation, but also for transcriptomics and proteomics data comparison. Researchers can survey biological characters behind gene and protein datasets and accelerate their investigation of transcriptome and proteome by applying the server. The server is publicly available at <url>http://www.biosino.org/iGepros/</url>.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

ExpressionPlot: a web-based framework for analysis of RNA-Seq and microarray gene expression data

Author: A Mortazavi
A Oshlack
B Langmead
Brad A Friedman
C Trapnell
C Trapnell
ET Wang
GK Smyth
H Li
J Bullard
J Goecks
JC Marioni
JT Robinson
M Reich
MD Robinson
S Akira
S Anders
TJP Hubbard
Tom Maniatis
U Nagalakshmi
Y Katz
Z Wu
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/12/2010
Field of study

RNA-Seq and microarray platforms have emerged as important tools for detecting changes in gene expression and RNA processing in biological samples. We present ExpressionPlot, a software package consisting of a default back end, which prepares raw sequencing or Affymetrix microarray data, and a web-based front end, which offers a biologically centered interface to browse, visualize, and compare different data sets. Download and installation instructions, a user's manual, discussion group, and a prototype are available at http://expressionplot.com/ webcite.ALS Therapy Allianc

DSpace@MIT

Crossref

Columbia University Academic Commons

PubMed Central

Software comparison for evaluating genomic copy number variation for Affymetrix 6.0 SNP array platform

Author: A Baross
BM Bolstad
DA Oldridge
DG Altman
Elizabeth J Atkinson
H Bengtsson
JC Marioni
Jeanette E Eckel-Passow
K Wang
L Winchester
Mariza de Andrade
RA Irizarry
RB Scharpf
RB Scharpf
Sharon LR Kardia
SJ Diskin
Sooraj Maharjan
The 1000 Genomes Project Consortium
WR Lai
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

Abstract Background Copy number data are routinely being extracted from genome-wide association study chips using a variety of software. We empirically evaluated and compared four freely-available software packages designed for Affymetrix SNP chips to estimate copy number: Affymetrix Power Tools (APT), Aroma.Affymetrix, PennCNV and CRLMM. Our evaluation used 1,418 GENOA samples that were genotyped on the Affymetrix Genome-Wide Human SNP Array 6.0. We compared bias and variance in the locus-level copy number data, the concordance amongst regions of copy number gains/deletions and the false-positive rate amongst deleted segments. Results APT had median locus-level copy numbers closest to a value of two, whereas PennCNV and Aroma.Affymetrix had the smallest variability associated with the median copy number. Of those evaluated, only PennCNV provides copy number specific quality-control metrics and identified 136 poor CNV samples. Regions of copy number variation (CNV) were detected using the hidden Markov models provided within PennCNV and CRLMM/VanillaIce. PennCNV detected more CNVs than CRLMM/VanillaIce; the median number of CNVs detected per sample was 39 and 30, respectively. PennCNV detected most of the regions that CRLMM/VanillaIce did as well as additional CNV regions. The median concordance between PennCNV and CRLMM/VanillaIce was 47.9% for duplications and 51.5% for deletions. The estimated false-positive rate associated with deletions was similar for PennCNV and CRLMM/VanillaIce. Conclusions If the objective is to perform statistical tests on the locus-level copy number data, our empirical results suggest that PennCNV or Aroma.Affymetrix is optimal. If the objective is to perform statistical tests on the summarized segmented data then PennCNV would be preferred over CRLMM/VanillaIce. Specifically, PennCNV allows the analyst to estimate locus-level copy number, perform segmentation and evaluate CNV-specific quality-control metrics within a single software package. PennCNV has relatively small bias, small variability and detects more regions while maintaining a similar estimated false-positive rate as CRLMM/VanillaIce. More generally, we advocate that software developers need to provide guidance with respect to evaluating and choosing optimal settings in order to obtain optimal results for an individual dataset. Until such guidance exists, we recommend trying multiple algorithms, evaluating concordance/discordance and subsequently consider the union of regions for downstream association tests.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Deep Blue Documents at the University of Michigan